162 research outputs found

    Extension of uncertainty propagation to dynamics MFCCs for noise robust

    Get PDF
    ABSTRACT Uncertainty propagation has been successfully employed for speech recognition in nonstationary noise environments. The uncertainty about the features is typically represented as a diagonal covariance matrix for static features only. We present a framework for estimating the uncertainty over both static and dynamic features as a full covariance matrix. The estimated covariance matrix is then multiplied by scaling coefficients optimized on development data. We achieve 21% relative error rate reduction on the 2nd CHiME Challenge with respect to conventional decoding without uncertainty, that is five times more than the reduction achieved with diagonal uncertainty covariance for static features only

    Component Structuring and Trajectory Modeling for Speech Recognition

    Get PDF
    International audienceWhen the speech data are produced by speakers of different age and gender, the acoustic variability of any given phonetic unit becomes large, which degrades speech recognition performance. A way to go beyond the conventional Hidden Markov Model is to explicitly include speaker class information in the modeling. Speaker classes can be obtained by unsupervised clustering of the speech utterances. This paper introduces a structuring of the Gaussian compo- nents of the GMM densities with respect to speaker classes. In a first approach, the structuring of the Gaussian components is combined with speaker class-dependent mixture weights. In a second approach, the structuring is used with mixture transition matrices, which add dependencies between Gaussian components of mixture densities (as in stranded GMMs). The different approaches are evaluated and compared in detail on the TIDIGITS task. Significant improvements are obtained using the proposed approaches based on structured components. Additional results are reported for phonetic decoding on the NEOLOGOS database, a large corpus of French telephone data

    Combining lexical and prosodic features for automatic detection of sentence modality in French

    Get PDF
    International audienceThis article analyzes the automatic detection of sentence modality in French using both prosodic and linguistic information. The goal is to later use such an approach as a support for helping communication with deaf people. Two sentence modalities are evaluated: questions and statements. As linguistic features, we considered the presence of dis-criminative interrogative patterns and two log-likelihood ratios of the sentence being a question rather than a statement: one based on words and the other one based on part-of-speech tags. The prosodic features are based on duration, energy and pitch features estimated over the last prosodic group of the sentence. The evaluations consider using linguistic features stemming from manual transcriptions or from an automatic speech transcription system. The behavior of various sets of features are analyzed and compared. The combination of linguistic and prosodic features gives a slight improvement on automatic transcriptions, where the correct classification performance reaches 72%

    Impact of frame rate on automatic speech-text alignment for corpus-based phonetic studies

    Get PDF
    International audiencePhonetic segmentation is the basis for many phonetic and linguistic studies. As manual segmentation is a lengthy and tedious task, automatic procedures have been developed over the years. They rely on acoustic Hidden Markov Models. Many studies have been conducted, and refinements developed for corpus based speech synthesis, where the technology is mainly used in a speaker-dependent context and applied on good quality speech signals. In a different research direction, automatic speech-text alignment is also used for phonetic and linguistic studies on large speech corpora. In this case, speaker independent acoustic models are mandatory, and the speech quality may not be so good. The speech models rely on 10 ms shift between acoustic frames, and their topology leads to strong minimum duration constraints. This paper focuses on the acoustic analysis frame rate, and gives a first insight on the impact of the frame rate on corpus-based phonetic studies

    Adding new words into a language model using parameters of known words with similar behavior

    Get PDF
    International audienceThis article presents a study on how to automatically add new words into a language model without retraining it or adapting it (which requires a lot of new data). The proposed approach consists in finding a list of similar words for each new word to be added in the language model. Based on a small set of sentences containing the new words and on a set of n-gram counts containing the known words, we search for known words which have the most similar neighbor distribution (of the few preceding and few following neighbor words) to the new words. The similar words are determined through the computation of KL divergences on the distribution of neighbor words. The n-gram parameter values associated to the similar words are then used to define the n-gram parameter values of the new words. In the context of speech recognition, the performance assessment on a LVCSR task shows the benefit of the proposed approach

    Combinaison de mots et de syllabes pour transcrire la parole

    Get PDF
    International audienceCombining words and syllables for speech transcription This paper analyzes the use of hybrid language models for automatic speech transcription. The goal is to later use such an approach as a support for helping communication with deaf people, and to run it on an embedded decoder on a portable device, which introduces constraints on the model size. The main linguistic units considered for this task are the words and the syllables. Various lexicon sizes are studied by setting thresholds on the word occurrence frequencies in the training data, the less frequent words being therefore syllabified. Using this kind of language model, the recognizer can output between 69% and 96% of the words (whereas the other words, will be represented by syllables). By setting different thresholds on the confidence measures associated to the recognized words, the most reliable word hypotheses can be identified, and they have correct recognition rates between 70% and 92%.Cet article analyse l'intérêt de modèles de langage hybrides pour transcrire de la parole. L'objectif est d'utiliser une telle solution pour aider à la communication avec des personnes sourdes, et de la mettre en oeuvre sur un terminal portable, ce qui introduit des contraintes sur la taille du modèle. Les unités linguistiques considérées pour cette tâche sont les mots et les syllabes. Des lexiques de différentes tailles sont obtenus en variant le seuil de sélection associé aux fréquences d'occurrence des mots dans les données d'apprentissage, les mots les moins fréquents sont alors décomposés en syllabes. Ce type de modèle de langage peut reconnaître entre 69% et 96% des mots (le reste étant représenté par des syllabes). En ajustant le seuil sur les mesures de confiance associées aux mots reconnus, les hypothèses de mots les plus fiables peuvent être identifiées (à un taux de bonne reconnaissance variant entre 70% et 92%)

    Hybrid language models for speech transcription

    Get PDF
    International audienceThis paper analyzes the use of hybrid language models for automatic speech transcription. The goal is to later use such an approach as a support for helping communication with deaf people, and to run it on an embedded decoder on a portable device, which introduces constraints on the model size. The main linguistic units considered for this task are the words and the syllables. Various lexicon sizes are studied by setting thresholds on the word occurrence frequencies in the training data, the less frequent words being therefore syllabified. A recognizer using this kind of language model can output between 62% and 96% of words (with respect to the thresholds on the word occurrence frequencies; the other recognized lexical units are syllables). By setting different thresholds on the confidence measures associated to the recognized words, the most reliable word hypotheses can be identified, and they have correct recognition rates between 70% and 92%

    About Combining Forward and Backward-Based Decoders for Selecting Data for Unsupervised Training of Acoustic Models

    Get PDF
    International audienceThis paper introduces the combination of speech decoders for selecting automatically transcribed speech data for unsupervised training or adaptation of acoustic models. Here, the combination relies on the use of a forward-based and a backward-based decoder. Best performance is achieved when selecting automatically transcribed data (speech segments) that have the same word hypotheses when processed by the Sphinx forward-based and the Julius backward-based transcription systems, and this selection process outperforms confidence measure based selection. Results are reported and discussed for adaptation and for full training from scratch, using data resulting from various selection processes, whether alone or in addition to the baseline manually transcribed data. Overall, selecting automatically transcribed speech segments that have the same word hypotheses when processed by the Sphinx forward-based and Julius backward-based recognizers, and adding this automatically transcribed and selected data to the manually transcribed data leads to significant word error rate reductions on the ESTER2 data when compared to the baseline system trained only on manually transcribed speech data

    Acoustical Frame Rate and Pronunciation Variant Statistics

    Get PDF
    International audienceSpeech technology enables computing statistics on word pronunciation variants as well as investigating various phonetic phenomena. This is achieved through a forced alignment of large amounts of speech signals with their possible pronunciations variants. Such alignments are usually performed using a 10 ms frame shift acoustical analysis. Therefore , the three emitting state structure of conventional acoustic hidden Markov models introduces a minimum duration constraint of 30 ms for each phone segment. This constraint is not critical at low speaking rates, but may introduce artefacts at high speaking rates. Thus, this paper investigates the impact of the acoustical frame rate on corpus-based phonetic statistics. Statistics on pronunciation variants obtained with a shorter frame shift (5 ms) are compared to the statistics resulting from the standard 10 ms frame shift. Statistics are computed on a large speech corpus of more than 3 million running words, and are analyzed with respect to the estimated local speaking rate. Results exhibit some discrepancies between the two sets of statistics, in particular for high speaking rates where the usual acoustic analysis frame shift of 10 ms leads to an underestimation of the frequency of the longest pronunciation variants

    Evaluation of PNCC and extended spectral subtraction methods for robust speech recognition

    Get PDF
    International audienceThis paper evaluates the robustness of different approaches for speech recognition with respect to signal-to-noise ratio (SNR), to signal level and to presence of non-speech data before and after utterances to be recognized. Three types of noise robust features are considered: Power Normalized Cepstral Coefficients (PNCC), Mel-Frequency Cepstral Coefficients (MFCC) after applying an extended spectral subtraction method, and Sphinx embedded denoising features from recent sphinx versions. Although removing C0 in MFCC-based features leads to a slight decrease in speech recognition performance, it makes the speech recognition system independent on the speech signal level. With multi-condition training, the three sets of noise-robust features lead to a rather similar behavior of performance with respect to SNR and presence of non-speech data. Overall, best performance is achieved with the extended spectral subtraction approach. Also, the performance of the PNCC features appears to be dependent on the initialization of the normalization factor
    • …
    corecore